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Abstract 

Ordinal data are often seen in real applications. Regular multicategory classification 
methods are not designed for this data type and a more proper treatment is needed. 
We consider a framework of ordinal classification which pools the results from binary 
classifiers together. An inherent difficulty of this framework is that the class prediction 
can be ambiguous due to boundary crossing. To fix this issue, we propose a non¬ 
crossing ordinal classification method which materializes the framework by imposing 
noncrossing constraints. An asymptotic study of the proposed method is conducted. 
We show by simulated and data examples that the proposed method can improve the 
classihcation performance for ordinal data without the ambiguity caused by boundary 
crossings. 
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1 Introduction 


Data with ordinal class labels are very common in reality and they are collected from many 
scientihc areas and social practices, such as disease diagnosis and prognosis, national secu¬ 
rity threat detection, and quality control. For example, the development of tumor can be 
classihed to Stage I, Stage II, Stage III, etc.] the U.S. homeland security advisory system 
has hve categories. Green, Blue, Yellow, Orange and Red, ordered from the least to the most 
severe threats; the quality of a randomly sampled product can be categorized to excellent, 
good, fair and bad. The goal of ordinal classihcation is to classify a data point to one of 
these ordinal categories, y E y, based on the covariates x E S G M'^. Here we consider the 
case y = {1, 2,..., iF}. The actual labels are of no importance, so long as the order can be 
recognized. 

Note that ordinal data are a special case of the more general multicategory data. Ignoring 
the order information, one may classify ordinal data in the same way as one would do 
multicategory data, by applying a multicategory classihcation method. There is a large body 
of literature for the latter. This includes those which combine multiple binary classihers, 
such as the One-Versus-One and One-Versus-Rest paradigms (see for example Duda et ah, 
2001), and those which estimate multiple classihcation boundaries simultaneously, such as 
Weston and Watkins (1999), Crammer and Singer (2002), Lee et ah (2004), and Huang 
et ah (2013). While using multicategory classihcation method for ordinal data sometimes 
works, such treatment can be suboptimal, because the classes are treated equally without 
their connections and relative superiority being considered. Moreover, a counterexample 
in Section 2 reveals that it is desirable to use an approach which fully utilizes the ordinal 
information available. 

Suppose there are K classes in total. A simple but very useful strategy for ordinal 
classihcation is to sequentially conduct binary classihcations between the combined meta¬ 
class Cfc = {1,..., fc} and meta-class Ck = {k + 1,..., K}, for 1 < k < K — 1, and then pool 
the classihcation results from these {K — 1) steps to reach a hnal prediction (see Frank and 
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Hall, 2001). In binary classification, usually a discriminant function / is estimated, and data 
point X is classified to the positive class if f{x) > 0, or to the negative class otherwise. The 
classihcation boundary is dehned by {x : f{x) = 0}. As there are {K — 1) binary classihers 
in this strategy, there are {K — 1) classihcation boundaries. This approach assumes that 
each class is sandwiched by two adjacent classihcation boundaries. 

An inherent difficulty of this approach is that since these boundaries are trained sepa¬ 
rately, it is possible that they may cross with each other. Consequently, how to make a hnal 
conclusion becomes ambiguous for some data points. 

In this article, we propose a hexible margin-based classihcation method for ordinal data. 
The direction we pursue is to construct the {K — 1) boundaries simultaneously. Our method 
is equipped with extra noncrossing constraints to hx the crossing issue, hence is named 
Noncrossing Ordinal Classihcation (NORDIC). Similar noncrossing constraints were studied 
and used in the quantile regression context (for example, Bondell et al., 2010, Liu and 
Wu, 2011). Compared to the vanilla idea of training [K — 1) binary classihers separately, 
simultaneous learning can borrow the strength from diherent classes, which leads to better 
classihcation accuracy and improved robustness to mislabeled data. Moreover, compared 
to many existing methods, our method is more hexible, since it does not assume that the 
boundaries are parallel. 

Among the existing related work in classifying ordinal data, Herbrich et al. (2000) tried 
to hnd the classihcation boundaries by maximizing the margin in the space of pairs of 
data vectors; Frank and Hall (2001) was among the hrst to consider the idea of pooling 
binary classihers; Shashua and Levin (2003) generalized the support vector formulation for 
ordinal regression and proposed to optimize multiple thresholds to dehne parallel separating 
hyperplanes; Chu and Keerthi (2005) improved the work of Shashua and Levin (2003) and 
guaranteed that the thresholds were properly ordered; Chu and Ghahramani (2005) used 
a probabilistic kernel approach based on Gaussian processes; Cardoso and da Costa (2007) 
replicated the data and cast the ordinal classihcation problem to a single binary classihcation 
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problem. Many of these approaches, although ensuring noncrossing, have posed a fairly 
strong assumption that the {K — 1) boundaries are parallel to each other (either in the 
original sample space or in the kernel feature space), which may be lack of flexibility and be 
unrealistic in many cases. 

The rest of the article is organized as follows. In Section 2, we compare the multicategory 
classihcation with the ordinal one, and review a simple framework for the ordinal classihca- 
tion. We introduce the main idea of the NORDIC method and the computation algorithm in 
Section 3. A more precise version of NORDIC, which makes use of a less popular optimiza¬ 
tion algorithm, is introduced in Section 4. The theoretical properties are studied in Section 

5. Several simulated examples are used to compare NORDIC with other methods in Section 

6. A real data example is studied in Section 7. Concluding remarks are made in Section 8. 

2 Ordinal Classification 

In this section, we hrst demonstrate, using a real example that, in some cases, it is better not 
to ignore the ordinal information by treating ordinal data as regular multicategory data. We 
then introduce a framework of ordinal classihcation via binary classihers. Lastly we compare 
the principles of multicategory and ordinal classihcations. 

2.1 An Example in U.S. Presidential Election 

In a multicategory classiher with K classes, usually K discriminant functions gk{x), k = 
1,... ,K, are estimated and the class prediction for x is Let rjkix) 

denote the conditional probability for the fcth class, r]k{x) = P(y = k\X = x). In 
this case, any multicategory classiher would aim to mimic the Bayes classihcation rule, 
= ^^S^^^k£{i,...,K} Vk{x), which has the smallest conditional classihcation risk, 
P(0(X) ^Y\X = x), among all possible rules. 

For the ordinal data, one can opt to ignore the ordinal information and classify them using 
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a multicategory classifier. However, a counterexample suggests that this may not always be 
a wise strategy. Consider the presidential election in the United State. Any voter can be 
viewed as being from a red state (a state which is most conservative and predominantly vote 
for the Republican Party), a blue state (a state which is least conservative and predominantly 
vote for the Democratic Party) and a purple state (also known as a swing state, where both 
parties receive strong support). In 2012, the states of North Carolina, Florida, Ohio, and 
Virginia were the swing states. There are many more blue and red states in the U.S. than 
swing states (and a much larger population in the former two types of states than that in 
the latter). Suppose each voter is associated with a covariate vector x E S and the color 
of her home state is the class label. The statistical task here is to classify her to one of the 
three types of states, y = {red, purple, blue}. 

Recall that the Bayes rule in multicategory classihcation classihes x to the class with 
the greatest r]k{x). It is more likely for a multicategory classiher to classify a voter to 
a blue state or a red state, since both tend to have larger rik{x). To see this, note that 
rik{x) = 7ikdk{x)/{'^^^ynede{x)}, where dk{x) is the density of the covariate X given that 
she is from the kth class and Hk is the unconditional class probability for the kth class. 
Clearly, both and TTi^j^g are much greater than TTp^^pjg, leading to that their rjk^xys 
tend to be larger as well. The bottom line is, it seems to be unfair that the chance that 
a voter from the purple state is correctly identihed is compromised simply because there is 
a smaller population in purple states. Ironically, in a U.S. presidential election, the swing 
states are the most important battleground, because it is the swing states that break the 
even in a presidential campaign. 

In this example, the imbalanced class prior probabilities appear to be the proximate 
cause that leads to the aforementioned issue. The underlying root cause, however, is that 
the ordinal data nature herein has been ignored. A classihcation method which makes use of 
the ordinal information is more appropriate in this case. We describe a simple strategy for 
this example here which leads to the more formal methodology in the next subsection: for a 
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randomly selected voter, we first consider classifying her to a blue state, versus a purple or 
red state. If she is classihed to the latter, then she tends to be relatively more conservative 
(than blue states voters). We then classify her to a blue or purple state, against a red state. If 
she is classihed to the former, then she is relatively less conservative (than red state voters). 
The results of the two comparisons can lead to the hnal conclusion that she is classihed to 
a purple state. 

2.2 Ordinal Classification via Binary Classifiers 

In general, consider an ordinal classihcation problem with K classes. Furthermore, con¬ 
sider {K — 1) binary classihers, where the fcth classihcation boundary separates the com¬ 
bined set {i Di E Ck] from the combined set {z : y* G Ck} where Ck = {1,..., fc} and 
Ck = {k + 1,K — 1}. For the fcth binary classihcation, we code the former the neg¬ 
ative class and the latter the positive class by constructing a dummy class label = — 1 
if y < k and -|-1 if y > k. The fcth binary classiher is associated with a discriminant func¬ 
tion fk{x) so that the classihcation rule is sign{fk{x)}. Let Zk{x) denote the prediction 
set of observation x with respect to the kth subproblem, dehned as Ck if fk{x) < 0, or Ck 
otherwise. The hnal prediction for x, aggregating all the results from the {K — 1) binary 
classihers above, will be the intersection of Zk{x), he., ni<fc<x-i 



BC I 

BC II 

BC III 

Class 1 

X 

/ 

/ 

Class 2 

/ 

/ 

/ 

Class 3 

/ 

X 

/ 

Class 4 

/ 

X 

X 


Table 1: An illustrative table showing the predictions of the three binary classifiers for an obser¬ 
vation in a four-class example. Aggregating the results of the three binary classifiers, we can reach 
the final prediction that the observation is classified to the second class. BC is short for “Binary 
Classifier”. 

In a four-class toy example. Table 1 tabulates the prediction of the three binary classihers 
for some observation x. The hrst binary classiher compares Class 1 and the meta-class 


5 









{2,3,4}. The prediction is that the observation is from {2,3,4}. Similarly, the second 
binary classifier compares {1,2} and {3,4} and the prediction is {1,2}. Lastly, the third 
binary classifier classifies the observation x to {1,2,3}. Clearly, Class 2 is favored by all 
three binary classifiers and it is the hnal prediction for x. This framework for reaching an 
ordinal classification prediction by pooling binary classihers was hrst noted by Frank and 
Hall (2001). 

2.3 Principle of Ordinal Classification 

We are now ready to compare the principles of multicategory classification and ordinal 
classihcation. A cartoon in Figure 1 can tellingly demonstrate the distinction between these 
principles. In a data set with iF = 4, there are two example data points (shown in the top 
and the bottom rows respectively). For each data point, the length of each block denotes the 
conditional class probability rik{x). The sum of all four conditional probabilities is 1. The 
principle in multicategory classification chooses Class 1 in the top example and Class 4 in 
the bottom example, as they correspond to the greatest rik{x) in both cases. In contrast, in 
ordinal classihcation, the desired prediction would be Class 2 and Class 3 respectively. For 
example, for the top example, the data point is more likely from Class {1,2} than from Class 
{3,4}, and more likely from Class {2,3,4} than from Class {!}. Hence Class 2 is the most 
plausible choice for this data point. Similarly, the data point in the bottom is most likely 
from Class 3. In particular, they both correspond to Class k such that <1/2 

and Y2e=iV^{^) >1/2 for each x. In the cartoon, a vertical line corresponding to 0.5 cuts 
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Figure 1: In the top panel, the multicategory classification principle chooses Class 1 while the 
ordinal classification principle chooses Class 2. In the bottom panel, the multicategory classification 
principle chooses Class 4 while the ordinal classification principle chooses Class 3. 
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the blocks for the desired predictions. 

A useful notion here is that the principle of multicategory classihcation is to select the 
‘mode’ of the class labels, based on r]k{x), while that of the ordinal classihcation is to select 
the ‘median’. 


3 Noncrossing Ordinal Classification 

Conducting ordinal classihcation via binary classihers is very easy to implement as long as one 
has access to an efficient binary classiher. There are many options, such as Support Vector 
Machine (SVM; Cortes and Vapnik, 1995, Vapnik, 1998, Cristianini and Shawe-Taylor, 2000), 
Distance Weighted Discrimination (DWD; Marron et ah, 2007, Qiao et al., 2010), hybrids 
of the two (Qiao and Zhang, 2015b, a), '0-learner (Shen et al., 2003), Large-Margin Unihed 
Machines (Liu et ah, 2011) and so on. 

However, because the (iC—1) classihcation boundaries are trained separately, it is possible 
that they cross with each other. Figure 2 is a cartoon which shows the possible crossing 
between classihcation boundaries. Here there are four classes (annotated as 1, 2, 3 and 
4) and three estimated classihcation boundaries (I, H and HI). The second and the third 
estimated boundaries cross with each other. Consequently, the red star point cannot be 
classihed properly. In particular, it will be classihed by classiher I to {2,3,4}, by classiher 
H to {1,2} and by classiher I to {4}. The intersection of all three prediction sets is empty. 
Although one may argue that this point might be Class 2 or Class 4, no dehnite answer can 
be given, and there is an ambiguity as to how to classify this red star point. 

Hence, it is desired that the estimated classihcation boundaries do not cross with each 
other. Let fk{x) be the discriminant function for the kth binary classihcation. Recall 
that its boundary are dehned by {x : fk{x) = 0}. For these boundaries to be noncrossing, 
mathematically, it is equivalent that for all a; G 5 not on any boundary, where 5 is a subset 
of M'’*, there exists fc G {1, 2,..., 77 — 1}, such that fe{x) > 0 for all 7 < fc and fe{x) < 0 for 
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Figure 2: A cartoon showing the possible crossing between estimated classification boundaries. 
Four classes of data (annotated as 1, 2, 3 and 4) with three estimated classification boundaries (I, 
II and III). Their true noncrossing boundaries are implied by their locations and are not shown. 
The second and the third estimated boundaries cross with each other. Consequently, the red star 
point cannot be classified properly. 

all i > k. Let S{x, k) = sign{/fc(a;)}. Then the condition above is the same as that S{x, k) 
is a monotonically decreasing function with respect to k for any fixed x & S, 


S{x, 1) > S{x, 2) > • • • > S{x, K-1). 


( 1 ) 


3.1 Direct NORDIC 

The noncrossing condition (1) can be fairly difficult to implement. We consider a sufficient 
condition hrst in this subsection. In this article, we use SVM as the basic binary classiher. 
For a Mercer kernel function K{-, •), the Representer Theorem (Kimeldorf and Wahba, 1971) 
allows the fcth classihcation function to be represented by fk{x) = YTj=i^k,jK{xj,x) + bk- 
Note that if we add the constraints that 

(^k,i > (^k+i,i and bk > bk+i for k = 1,K - 2, 
then as long as the kernel function is always nonnegative with > 0 (which is true 

for many kernel functions such as the Gaussian radial basis function kernel), we will have 
fk{x) > fk+i{x), and hence S{x, k) > S{x, k + 1) for any x G S. 


Hence we consider a direct approach to NORDIC, called NORDIC-0, by solving the 



following joint optimization problem with the extra noncrossing constraints (3)-(4): 

K—\ r \ " 

+-u:lKu:k. , ( 2 ) 

k=l 1=1 

where fk{x) = J2'j=i^k,jK{xj,Xi) + bk, the coefficient vector for the fcth function is u)k,- = 
{cok,!, ■ ■ ■ ,^k,n)'^, and is an n by n matrix whose (i, j)th entry is Kij = K{xi, Xj), subject 
to 


h > bk+i, ioY k = 1,... ,K - 2, 


(3) 

^k,i > ^k+i,h for i = 1,... n, fc = 1,. 

..,K-2. 

(4) 


Here ui^KoJk- is the regularization term for the fcth discriminant function. 

The term inside the square bracket of (2) is the objective function of kernel SVM cor¬ 
responding to the fcth classiher. We try to minimize the sum of these {K — 1) objective 
functions with the extra noncrossing constraints (3)~(4). 


3.2 Indirect NORDIC 

The constraints (3)-(4) for NORDIC-0 are sufficient conditions for noncrossing boundaries. 
However, such condition may be too strong. A weaker, but almost sufficient set of conditions 
would be inequality (3) along with the inequality that Kujk- > K(jj(^k+iy) for A; = 1,..., A" — 
2. Note that they ensure that fk{xi) > fk+i{xi) for all the data Xi in the training data set. 
Thus when the training data is rich enough to cover the base of S, then they are almost 
sufficient conditions for noncrossing. This approach is an indirect approach to noncrossing 
through the training data points, which is called NORDIC-1 in this article. A bonus of this 
set of constraints compared to (3)~(4) is that one does not need to take the inverse of K 
later in the implementation, which we will explain in the next subsection. 

Let yf., = {y[^\ ... yn'^Y' be the dummy class label vector of the n observations for the kth 
classiher, and e = (1,... 1)^. For neatness, we let Yk- denote the diagonal matrix with y^., 
as its diagonal elements, he., Yk- = diag(7/fc.). By replacing the Hinge loss {1 — yl’^'^fk{xi)}+ 
in (2) by a slack variable ^k,i > 0, and incorporating the new constraints, we can write the 
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optimization problem for NORDIC-1 as, 

^-1 /I \ 

, (5) 

k = l ^ ^ 

subject to 

e -Yk.{Ku)k- + hue) < ^k-, ior k = 1,..., K - 1, (6) 

> 0 , for fc = 1,...,R:- 1, (7) 

bk > bk+i, ioT k = 1,... ,K -2, (8) 

KiVk- > Ku}(^k+i)-, for k = 1,... ,K -2. (9) 


3.3 Implementations of NORDIC 


We start off by deriving the Wolfe duality of the optimization problem for NORDIC-1. The 
implementation of NORDIC-0 will come clearer later as a variant of that of NORDIC- 
1. We introduce nonnegative Lagrange multipliers cXk. = {ak,i, ■ ■ ■ ctk^n)'^ £ 1^+, Ck- = 
(Cfc.i, • • • Cfc.n)^ e IR+, 7fc e K+ and cpk. = {(pk,i, ■ ■ ■ (Pk,n)'^ e K+ for the constraints (6), 
(7), (8) and (9) respectively. The Lagrangian for the primal problem (5)-(9) is, 

K-l 

k=l 

+ al{e -Yk.{K(jJk- + bkc) - 



- ~ Ikibk - bk+l)l{k^K-l} 

~ V^kX^^k- — KLL>(^k+l)-)^{kf!=K-l} 

It can be rearranged, so that in the square bracket, the subscripts for the primal variables 


are with the same index A;, as follows, 

K-l 


k=l L 

-F CKfc.je - Yk.{K(jJk. + bks) - 


(10) 


- Ck-^k- ~ bk (7fcl{A:^X-l} - 7fc-ll{fc^l}) 

~ {Vk-^{k^K-l} - 
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The Karush-Kuhn-Tucker (KKT) conditions for the primal problem require the following: 

dC 

Q=j— = KuJk. - KYk.cxk. ( 11 ) 

dujk- 

dC 

0 = ^ = Ce - ttfc. - Cfe- (13) 

Once the KKT conditions (12) and (13) are inserted to (10), the items that are associated 
with bk and ^k- be eliminated. Moreover, from (11), we have 

KtJk- = K [Yk-OLk- + {v^k-'^lkj^K-l} - ‘^(fc-l).l{fc7^1}) } , 

which leads to 


U^k- - Yk-OLk. + {iPk.1{k^K-l} - 
when K is full rank. Let 


R 


diagjl^fc.} 


(n) 


l<k<K-l I ^n(K-l) 


I 


(-n) 


(K-1) I 0n{K-l)x{K-2) 


and 6 = (ck; c^; 7 ), where for a. m x n matrix A, denotes a (m + s) x n matrix whose 
upper m rows are A and the bottom s rows are all 0, and A^“®^ denotes a (m + s) x n matrix 
whose bottom m rows are A and the top s rows are all 0. Summarizing all these conditions, 
we can see that the optimality of the primal problem is given by the dual problem, 

max - 6^ [rF {1k-i ® K) i?} 6 + e^a, 

subject to - yl.OLk- - - ^k-i^{k^i}) = 0 , 

0 < OLk < C'e, > 0 , 7 a : > 0 , 


where 0 is the Kronecker product. 

The dual problem above is nothing but a quadratic programming (QP) problem about 
otk-yiPk-ylk with equality and bound-inequality constraints, which can be solved by many 
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third-party off-the-shelf QP subroutines. More efficient implementations, such as Platt’s 
SMO (Platt, 1999), are possible, but is not explored here as it is beyond the scope of this 
paper. 

The optimal primal variables u; are calculated from the optimal dual variables using 
the relation u^k- = Yk-Otk- + ■ By the KKT complementary 

conditions, the bias term bk for the fcth classiher can be found from any Xi in the training 
data with 0 < ak,i < C, due to the relations that 1 — yl.^\Yyj=i^kjK{xj,Xi) + bk} = 0 . 
Alternatively, one can £x the cj’s in the primal (5) as known and minimize (5)“(9) with 
respect to bk and ^k-- This would lead to a linear programming problem. 

The implementation for NORDIC-0 is similar, except that the Lagrangian is 

K-l 

A = ^ 

k=l 

+ Ck.l,{e - Yk-{K(^k- + he) - 



- - Ikibk - bk+i)'^{k^K-i} 

The only difference of the Lagrangian of NORDIC-0 from that of NORDIC-1 is underlined. 
Consequently, the KKT conditions are almost the same, except that. 


0 = 


dCo 

dujk. 


KiOk- K^Y' k-ex-k- 


This leads to ujk- at the optimality being 


^k- = Yk-CXk. + K ^ , 

assuming that K is invertible. The rest of the implementation is identical to that in 
NORDIC-0, except that we let 


R = 


diag{yfc}i<KK-i I 

— I 0r,(A-_i)x(A'-2) 


and 6 = (ck; cp; 7 ). 
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4 Exact NORDIC via Integer Programming 


Recall that the necessary and sufficient condition for noncrossing (1) is that the sign of 
fk{x), S{x, k), is a monotonically decreasing function with respect to k for any hxed x E S, 
S{x,l) > S{x,2) > ■•■ > S{x,K — 1). The constraints for NORDIC-0 and NORDIC-1 
that we have discussed in the last section is sufficient to ensure that fi{x) > f 2 {x) > ■ ■ ■ > 
fK-i{x), which ultimately ensures noncrossing. However, they are not the weakest sufficient 
conditions we can impose. As a matter of fact, the discriminative functions fk themselves 
need not to be monotonically decreasing with respect to k in order for noncrossing. In this 
section, we explore an idea which aims for exact noncrossing by posing conditions on the 
sign of the discriminative functions. 

For each x E S, there are one out of two alternative situations with regard to the 
prediction result from a discriminant function fk'. either fk{x) < 0 or fk{x) > 0. According 
to the noncrossing condition ( 1 ), the former implies that fk+i{x) < 0 (recall that the sign is 
monotonically decreasing in k). Thus, the noncrossing condition (1) is logically equivalent 
to the condition that at least one of the following two constraints is satished, 

(i) fk{x) > 0 , and (ii) fk+i{x) < 0 ; 

i.e., (i) and (ii) cannot be both false. Specihcally, if (i) is not true, i.e., if fk{x) < 0, then 
(ii) is true. This leads to the noncrossing condition. 

Such logical implication can be modeled by the following Logical Constraints which in¬ 
volve binary integer variables zik, ^ 2 fc G { 0 , 1 }, 

-fk{x) - MiZik < 0, 
fk+l{x) - M 2 Z 2 k < 0, 

Dfc + Z2k < 1, 

where Mi and M 2 are two large numbers due to technicality. In particular, zik + 2 ^ 2 ^ < 1 
implies that at least one between zik and Z 2 k has to be zero, hence (considering the hrst 
two constraints) either —fk{x) < 0 or fk+i{x) < 0 , or both are true; this is the noncrossing 
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condition discussed above. Note that if both zik and Z 2 k were 1, then the hrst two constraints 
became —fk{x) < Mi and fk+i{x) < M 2 ., which would essentially impose no constraint on 
fk{x) and fk+i{x) so that the undesired case that fk{x) < 0 and fk+i{x) > 0 may occur. See 
Bradley et ah (1977) for an introduction to integer programming. We can use this technique 
to model the noncrossing constraints. In particular, we seek to 


subject to 


K-l 


min 


E 


(l - + A||a;fc.||i 


(14) 


- fk{xi) - Miziik < 0, (15) 

fk+l{Xi) - M2Z2ik < 0, (16) 

Zlik + Z2ik < 1, (17) 

^2ik e{o,l}, (18) 


for i = 1, 2,... ,77, and k = 1,..., K — 2. This method is referred to as NORDIC-2 in this 
article. Here the constrains (15)-(18) are almost sufficient and (exactly) necessary conditions 
to noncrossing. It is again not exactly sufficient because we impose the constraints to all 
the training data vectors, instead of all x ^ S, similar to the case of NORDIC-1. However, 
again, if the data vectors in the training data are rich enough, noncrossing across the board 
can be expected. These conditions are weaker than those in NORDIC-0 and NORDIC-1 
because they ensure the monotonicity of the sign of fk, rather than the value of fk itself. 

Note that the objective function of NORDIC-2 is a little different from those of NORDIC- 
0 and NORDIC-1, especially in the use of the Li norm penalty. We choose not to use the 
more common L 2 penalty, which leads to a quadratic objective function in SVM, because 
it is rather difficult to solve a mixed integer programing problem with quadratic objective 
function. In fact, we are not aware of an efficient off-the-shelf computing freeware which 
solves such a problem. In order to show the usefulness of the new noncrossing constraints, 
which is the main point of this article, we choose to use the Li penalty for computational 
simplicity. 
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It is worth noting that so long as there is an efficient mixed integer programming package 
which is capable of dealing with quadratic objective functions, an extension will be very 
natural and readily available. 

Indeed, integer programming can solve such nonstandard problem which traditional opti¬ 
mization methods such as QP or linear programming cannot. However, integer programming 
can been overlooked by statisticians for a long time (probably due to the high computational 
cost and few statistical problem that this method applies). To the author’s best knowledge, 
this article is one of only a few work in the statistical literature which employs the integer 
programming technique. See Liu and Wu (2006) for another instance which uses mixed 
integer programming to solve a statistical problem. 

5 Theoretical Properties 

In this section, we study two aspects of the theoretical properties of NORDIC. The hrst 
subsection is about the Bayes rule and Fisher consistency of the loss function in ordinal 
classihcation. The second one pertains to the asymptotic normality of the NORDIC solution. 

5.1 Bayes rules and Fisher Consistency 

For binary classihcation, a classiher with loss Vi{yf{x)) : M i—>■ M_|_ is Fisher consistent if the 
minimizer of E[Vi{Yf(X))\X = x] has the same sign as P(y = 1|X = x) — 1/2. The latter 
is the Bayes rule for binary classihcation. Intuitively, Fisher consistency requires that the 
classiher yields the Bayes decision rule asymptotically. See Lin (2004) for Fisher consistency 
of binary large margin classihers. 

In multicategory classihcation, a classiher with loss function V 2 {y, f{x)) : M x i—>■ IR+, 

where f{x) : S h-)■ is the K discriminant functions, is Fisher consistent if the minimizer of 

¥.[V 2 {YJ{X))\X = x], g*{x) = {gl{x),...,g*j^{x))^, satishes that argmax^g^^^. ^^(a;) = 
7]k{x). Here, argmax^g|]^ ;^| 77 ^( 0 ;) is the Bayes classihcation rule for mul- 
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ticategory classification. See, for example, Liu (2007) for some discussions on Fisher consis¬ 
tency for multicategory SVM classifiers. 

Below we formally define the Bayes rule and Fisher consistency for ordinal classihcation. 
The Bayes rule for ordinal classihcation is = k where k is such that < 

1/2 and Y’e=iVi{.x) >1/2. This rule guarantees that each component binary classihcation 
in ordinal classihcation yields the Bayes rule in the binary sense. 


Definition 1. (Generalized Fisher consistency for ordinal classification) An ordinal classi¬ 
hcation method with loss function V 3 (-, •) is Generalized Fisher consistent if for any x, the 

minimizer f*{x) = {f({x ),..., fK_fix)Y of 

[K-l 1 


E 


,k=l 


satishes that sign(//(a;)) = sign(l/2 — Y!1=i Ve{^)) ioi k = 1,K — 1. Here is the 
dummy class label for Class Y in the fcth binary classihcation subproblem. 


Generalized Fisher consistency means that the {K—1) discriminant functions f({x ),..., ffi_i{x) 
jointly trained under the loss function V 3 , is essentially the same as the Bayes rule 
as n —>■ 00 . Note that 4>Qffi^^{x) has the smallest risk with respect to the aggregated 0-1 loss 
for the {K — 1) binary subproblems. Hence it is also the one which has the smallest risk 
under the so-called distance loss, dehned as L{(f),y) = \(j) — y\ (see Qiao, 2015). 

Because of the use of the Hinge loss function for SVM (which is Fisher consistent in the 
binary sense), our NORDIC method is Generalized Fisher consistent for ordinal classihcation. 

The proof is omitted. 


5.2 Asymptotic Normality of Linear NORDIC 


When the kernel function K{xi,X 2 ) = xfx 2 , that is, the linear kernel, we can have the 


following linear NORDIC classiher, with the objective function, 

K-l 


E 

k=l 


^ (1 - yY + bk)) + 


A 


2 = 1 


(19) 
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and one of the two following sets of constraints that correspond to NORDIC-1 and NORDIC- 
2 respectively, 

+ hk> + hk+i, 

and 

- (xJiVk- + bk) - Mizik < 0 , 

{xju}k+i,. + bk+i) - M2Z2k < 0 , 

Dfc + Z2k < 1, 

Dfc, Z2k £ {0,1} , 

for i = 1, 2 ,..., n and A; = 1,..., iC — 2. 

Becanse linear kernel conld be negative, the NORDIC-0 method cannot be directly ex¬ 
tended to the linear kernel case. We can nse the techniqne in Lin and Wn (2011) to create 
a new kernel that satishes the nonnegativity assnmption essential for NORDIC-0. In this 
snbsection, we prove the asymptotic normality of linear NORDIC. 

Koo et al. (2008) has provided a Bahadnr representation of the linear SVM and proved its 
asymptotic normality nnder some conditions. In particnlar, they have shown that (w, 6)^ — 
= Op(n“^/^), where (d;,6) are the solntion to the SVM classiher and (n;°,6°) are 
the minimizer of the expected loss fnnction. 

Theorem 1 below shows that the limiting distribntion of the constrained NORDIC so¬ 
lntion has the same limiting distribntion to the nnconstrained binary SVM classifiers. To 
prove this resnlt, we need all the regnlarity conditions in Koo et al. (2008). 


Theorem 1. For 1 < k < K — 1, let {uJk-ibk) and {ujk.-,bk) he the eonstrained and un¬ 
constrained solutions, respectively, to the kth binary linear SVM problem in (19). As¬ 
sume that the regularity conditions in Koo et al. (2008) are satisfied for k. Then for any 
u e 


P 






< u 
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p 


n 


< wj ^0, 

so that the constrained solution has the same limiting distribution as the classical uncon¬ 
strained solution. 


Based on Theorem 1, inference for the constrained NORDIC can be obtained by applying 
the known asymptotic results for binary linear SVM, through the unconstrained NORDIC 
solutions. For example, we can show the asymptotic normality of the coefficients to the SVM 
components in linear NORDIC in the same way as those in Koo et ah (2008). 


6 Numerical Results 

We compare NORDIC-0, NORDIC-1, NORDIC-2, the vanilla ordinal classihcation method 
that uses (K — 1) separately trained (Frank and Hall, 2001) using binary SVM classihers 
(BSVM), the data replication method by Cardoso and da Costa (2007) (DR) and the par¬ 
allel discriminant hyperplane method by Chu and Keerthi (2005) (CK). We use our own 
experimental codes in the R environment to implement these methods. The Gaussian radial 
basis function kernel is used for all classihers. The kernel parameter is tuned among the 
10%, 50% and 90% quantiles of the pairwise distances between training vectors. The tuning 
parameters are tuned from a grid of possible values ranging from 2 “^, 2 “^,... ,2“^. 

6.1 Nonlinear Three-class Examples 

We consider a data setting with three classes and d variables: Xi, X 2 ,..., X^, where 

• Xi = Xi -|- aN{0, 1) and Xi ~ Uniform(—3, 3), 

• X 2 = X 2 -|- aN{0, 1) and X 2 ~ Uniform(—6, 6), 

• and X 3 ,... ,Xd ~ X( 0 , 1 ). 
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Here, Xi and X 2 truly determine the class labels (see below) but only their contaminated 
counterparts Xi and X 2 are observed. In particular, let 

/i = - 2 X 1 + 0.2X2 _ ^ Q 2, 

f 2 = -0.4X2 ^ q_ 2 x 2 _ 0 . 4 , 

/3 = 2 X 1 + 0.2X2 _ ^ Q 2. 

We assign each observation to class k with probability proportional to exp(/fc) for fc = 1, 2, 3. 
We generate 100 data points in the training set, 100 in the tuning set and 10000 in the test set. 
The standard deviation of the measurement error, cr, ranges from 0.5, 1 to 1.5. When d = 5 
and cr = 0 (no perturbation), this is the same example as the nonlinear example in Zhang 
et al. (2008). However, we perturb the data and increase the dimension [d = 10, 20,..., 50) 
to make the problem more challenging. 

Note that this example was initially designed by Zhang et al. (2008) as a regular mul¬ 
ticategory classihcation, instead of an ordinal classihcation one. Figure 3 shows a sample 


First two dimensions of a sample (n=500) with zero noise 



Xi 

Figure 3: Nonlinear three-class examples: A scatter plot showing the first two dimensions of a 
realization with no additional error added. The Bayes rule is also shown. 
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method | • DR| - -A - ' | CK| — M — | BSVM| - ' | nordic-01 ■ -Bl ■ |nordic-11 ■ ■ |nordic-2 



Figure 4: Nonlinear three-class examples: The top row shows the error rate for different methods 
(in different line types) with 3 noise levels (the left, middle and right panels) and 5 different dimen¬ 
sions (shown on the horizontal axis of each subfigure). The bottom row shows the computational 
time. In general, a NORDIC method is better than a non-NORDIC method for this example. 


realization of the data without perturbation at the hrst two dimensions. In a general sense, 
Class 2 can be viewed as in the middle of Class 1 and Class 3. We pretend that the class 
labels are of an ordinal nature and compare different ordinal classihcation methods. 

Figure 4 summarizes the results over 100 simulations. The NORDIC-0 and NORDIC- 
1 are the better classihers in terms of classihcation performance when the dimensions are 
small. For higher dimensions, the NORDIC-2 method is better than the other methods. The 
DR method is the most computational costly and the CK method is the most efficient one. 
The reason that NORDIC works here is probably due to the perturbation that is added to 
this data set. A NORDIC method, with the help of the noncrossing constraints, can borrow 
strength from different classes and become more robust to perturbation. 
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First two dimensions of a sampie (n=200) w/o pertubation 



Xi 

Figure 5: Donut examples: A scatter plot showing the first two dimensions of a realization, with 
no perturbation added. The natural boundaries between classes are also shown. 

6.2 Donut Examples 

We now consider a more challenging setting, which is tailered toward the ordinal data. We 
hrst generate data points from a 2-dimensional plate with radius 4 uniformly, and label them 
as from Class 1, except for those which are within a circle centered at (1.9, 0)^ with radius 2, 
which are labeled as Class 2, and those which are within a circle centered at (^/S+O.l, 0)^ with 
radius y/S, which are labeled as Class 3. The observations for the additional {d—2) dimensions 
are all 0. We then perturb all the data points by adding independent d-dimensional Gaussian 
distributed random vector from Nd{0, crl). We let a = 0.2, 0.4 and 0.6 and let d range from 
10 to 75. Figure 5 shows one realization of the data on the hrst two dimensions without the 
perturbation and the natural boundaries between the classes. This generalizes the classic 
donut examples in nonlinear classihcation. 

Note that Class 2 is sandwiched by Class 1 and Class 3 from both outside and inside, 
and the high density region for Class 2 is very thin due to the construction. Hence, it 
is perceivable that a Class 2 observation is very difficult to be correctly classihed. The 
noncrossing constraints here may be of some help because the boundary between Classes 1 
and 2 may boost the estimation of the boundary between Classes 2 and 3, and vice versa. 
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The simulation results are reported in Figure 6 . The hrst row shows the test error over 
100 simulations. It appears that many times the DR method is the best. However, recall 
that in this data set the three classes are highly imbalanced in terms of their sample size. On 
average, there are only 6.25% Class 2 points and 18.75% Class 3 points. A more reasonable 
measure to look into here is some weighted error rate that incorporates the different costs of 
misclassihcation. Here we report the weighted error with the conhguration that: 

• each misclassihed point from Class 1 costs 1; 

• each misclassihed point from Class 2 to either Class 1 or Class 3 costs 2 ; 

• each misclassihed point from Class 3 to Class 2 costs 1, and from Class 3 to Class 1 
costs 3. 

Such assignment of the cost rehects the protection for Class 2 , and the additional penalization 
for misclassifying across two boundaries (the cost for misclassifying from Class 3 to Class 1 
is the sum of the costs for misclassifying from 3 to 2 and from 2 to 1.) 

The second row of Figure 6 reports the weighted error rate. It is obvious that expect for 


method | • |dr| - ^ ' |ck[~— B — |bSVM| — |nordic-0|' 'El' |nordic-1 [■ ■|nordic-2 


sigma = 0.2 sigma = 0.4 sigma = 0.6 



Figure 6: HD donut examples: The top row shows the error rate for different methods (in different 
line types) over 12 experiments with 3 noise levels (the left, middle and right panels) and 5 different 
dimensions (shown on the horizontal axis of each subfigure). The bottom row shows the weighted 
error rate. The NORDIC-2 is the best classifiers in terms of classification performance. 
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NORDIC-2, which is the best in this case, all other methods are more or less the same in 
terms of the weighted error. Interestingly, the NORDIC-0 and NORDIC-1 methods do not 
perform as well as their sibling NORDIC-2. They perform comparably to the other methods 
(they may have a very small advantage over CK and DR methods when the perturbation is 
small, for example, when a = 0.2 and 0.4.) Recall that NORDIC-0 and NORDIC-1 imposes 
stronger constraints which aim for the monotonicity of the discriminant function fk{x) itself, 
as opposed to its sign. In contrast, the constraint from NORDIC-2 is much lighter, which 
may have left enough “degrees of freedom” to optimize the generalization performance. 

One may argue that the choice of the costs in the weighted error may be arbitrary. In 
this case, it may be helpful to look into the confusion matrix to see the cause of the different 
performance. Figure 7 depicts the 3x3 confusion matrices for the case with contamination 
(j = 0.4 for different methods and different dimensions. For the {k, £)th plot in the array, 
the reported value is the proportion of observations from Class i that are classified to Class 


1.0 

0.5 

0.0 

1.0 


0.0 
1.0 

0.5 

0.0 

method | • | bSVm [~^-A - | ck | — > — ] dr | — | nordic-0 | ■ ‘Bl ■ | nordic-1 1 ■ • | nordic-2 

Figure 7: Confusion matrices for the examples with a = 0.4 for different methods and dimensions. 
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A; (£, A; = 1, 2, 3). Note that the aggregation of the three plots in the same column equals to 
1. A good classiher is expected to have high rates in the diagonal plots and low rates in the 
off-diagonal plots. There are, on average, 7496.2 observations from Class 1, and almost all 
the methods classify them correctly. Class 2 (with only 626.5 observations) is clearly a very 
difficult class. Even our NORDIC-2 has a poor classihcation accuracy of 25%. That said, 
NORDIC-2 shows more advantages for higher dimensional cases. For Class 3, NORDIC-2 
shows improved accuracy, especially with much fewer misclassihcations into Class 1 (shown 
in the upper-left plot). 

The computational time results are similar to what we have seen for the last example 
and are not reported here. 

7 Real Application 

We use the scale balance data set from the UCI Machine Learning Repository (Lichman, 
2013) to test the usefulness of the NORDIC method. This data set, studied in Siegler (1976), 
was generated to model psychological experimental results. Each example is classihed as 
having the balance scale tip to the right, tip to the left, or be balanced. The four attributes 
are the left weight, the left distance, the right weight, and the right distance. The correct 
way to hnd the class is the greater of (left-distance * left-weight) and (right-distance * right- 
weight). If they are equal, it is balanced. There are 625 instances in the data, with 288 tip 
to the left (L), 49 balanced (R), and 288 tip to the right (R). 

There is a clear order between the three classes (L, B and R,) and hence ordinal classihca- 
tion methods are appropriate. We randomly select n points from the data set for training, n 
for tuning, and the remaining (625 — 2n) are for testing. The proportions of the three classes 
are preserved when the partitioning is conducted. The random experiment is repeated for 
100 times. We consider four cases, where n = 52, 79, 125 and 208 respectively. 

A naive coding of 1, 2 and 3 for these three classes followed by a regression method will 
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method | • | dr | ——' | ck [— *■— | bSVm [~~--‘-- | nordic-o [~^ -a--- | nordic-1 1 ^ | nordic-2 | — ] SVR 

Figure 8: Weighted error rate for the scale balance data set. 


prove to be suboptimal. In particular, in addition to the ordinal classification methods, we 
also compare with support vector regression (SVR Smola and Scholkopf, 2004, implemented 
by svm() in the R package el071) with Gaussian radial basis function kernel. SVR is applied 
to the data with {1,2,3} coding, and the predicted class is obtained by cut-off values 1.5 and 
2.5 for the predicted numerical outcome. 

Figure 8 shows the weighted error rates of different methods over 100 random splitting 
of the data set and 4 different sample sizes. Here we let a misclassihed point from Class 3 
to Class 1, or from Class 1 to Class 3, to bear a cost of 2; other types of misclassihcation 
cost only 1. All three NORDIC methods are among the best, with NORDIC-2 having a 
signihcant advantage. The other two NORDIC methods are comparable to the DR method 
especially for small sample cases. The SVR is the worst classiher in this experiment. 

Figure 9 shows the confusion matrices. It can be seen that the poor performance of the 
SVR method is probably because it classihes much more instances to Class B, and this may 
be due to the arbitrary choice of the cut-off values 1.5 and 2.5. However, one may have no 
better way to choose the cut-offs except through another layer of tuning parameter selection. 
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On the other hand, NORDIC-2 stands out as the best classiher due to its best performance 
on Class B among the other methods (except for SVR.) Note that for Classes L and R, all 
methods (except for SVR) perform more or less the same. 

8 Concluding Remarks 

In this article, three versions of NORDIC classihers are proposed to make use of the order 
information in classifying ordinal data. All three classihers train {K — 1) binary SVM 
classihers simultaneously with extra constrains to ensure noncrossing among classihcation 
boundaries. The NORDIC-0 and NORDIC-1 methods focus on a sufficient condition for 
noncrossing and are solved by QP. The NORDIC-2 method aims for the exact condition for 
noncrossing but has to be solved by the integer programming algorithm. 

Let us turn our attention back to the formulation for NORDIC-0, (2)-(4). With- 


scale balance: confusion matrix 
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Figure 9: Confusion matrices for the scale balance data set. 
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out the additional constraints (3) and (4), the NORDIC-0 method is the combination of 
{K — 1) independently trained SVM classihers (with the common tuning parameter). It 
is known that for a single binary SVM classiher, the discriminant function is given by 
f{x) = The coefficients cu* = ajj/j is calculated by maximizing the 

following dual problem of SVM, 

^SVM ^ ^ 2 ^ ^ {,^ii ^j)i 

i ij 

subject to aii/i = 0, and 0 < ai < C. (20) 

i 

See, for example, Burges (1998) for a tutorial. The maximization problem above is the dual 
problem of SVM, while our NORDIC-0 method was based on the primal problem of SVM. 

One may wonder if a dual-based NORDIC is possible. Indeed, a variant of NORDIC 
can be viewed as to maximize the sum of {K — 1) such objective functions as in (20), with 
extra noncrossing constraints that ujk- > that is ■ Note that 

the constraints are the same as in NORDIC-0 but the objective function is based on the dual 
objective function. However, one can show that this formulation ultimately reduces to the 
method proposed by Chu and Keerthi (2005), namely, all the {K — 1) classihers share the 
same uj vector. Hence the CK method can be viewed as a special case in the NORDIC family. 
Note that in our NORDIC-0 proposal, we focus on the primal formulation. As a consequence, 
the resulting boundaries are not parallel to each other, leading to more hexibility. 

The usefulness and efficiency of the proposed methods are supported by the comparison 
with the competitors. Promising results are obtained from simulated and data examples. 
Fisher consistency of the NORDIC method and asymptotic normality of the linear NORDIC 
method further validate the proposed methods. 

There is a natural connection between ordinal classihcation and ordered logistic regres¬ 
sion. Both methods fully utilize the ordinal class information. Their difference can be viewed 
as analogous to the difference between binary SVM and (binary) logistic regression, or that 
between multicategory SVM and multinomial logistic regression. It is interesting to explore 


27 


the benefit of using machine learning techniques including NORDIC, over the modeling ap¬ 
proaches such as ordered logistic regression. See Lee and Wang (2015) for such a comparison 
in the binary case. 

We have provided three distinct formulations. They may perform differently on different 
types of data sets, both in terms of the generalization error and the computational time; 
the derivation of these optimization problems may give insights into which kernels can more 
easily admit truly non-crossing boundaries. It is an interesting future research direction to 
identify specihc kernels for which we can provide truly non-crossing boundaries. 


Appendix 


Proof to Theorem 1 

Let Zn and denote and 

spectively. Then 

P f Z„ < u] — P f Z„ < u 


■MY - re- 


= P <u\Zn^Zn]-^[Zn<u\Zn^Zr, 

X P (^Zn 7^ Zr^ 

Since the hrst term in the product is bounded by 2, it suffices to show that P ^ j —)■ 0. 
The event, Zn ^ Zn, is equivalent to the event that the unconstrained binary linear SVM 
classihers have boundaries crossing from each other, that is, 

jsign [x^Uk- + - sign {x^u}(^k+i)- + } < 0 

for some x E S. This is only possible when x'^Cjk- + bk < 0 and x'^uj(^k+i)- + bk+i > 0. We 
consider their difference 

d/2 I (^x^u^k- + - (^x'^<^(k+iy + } • 


n 






This difference can be written as 


I (x^Cjk- + + hi) I 

- 1 (x^G)(^k+iy + &fc) - {x^uj\k+iy + Ci)} 

+ { {x^u)l. + hi) - } 

Under the regularity conditions, and due to the results in Koo et ah (2008), the first two 
terms above are Op{l). Thus, | {x^ujI, + 6°) — | < —C < 0. This 

contradicts the fact that sign {x^ujI, + 6°) > sign + bl_^_^'^ due to the assumption 

that the conditional density for each class is positive. Thus P (^Zn ^ Zj^ —)■ 0 which 
completes the proof. 
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